MDP Optimal Control under Temporal Logic Constraints 

- Technical Report - 

Xu Chu Ding Stephen L. Smith Calin Belta Daniela Rus 



Abstract — In this paper, we develop a method to automati- 
cally generate a control policy for a dynamical system modeled 
as a Markov Decision Process (MDP). The control specification 
is given as a Linear Temporal Logic (LTL) formula over a set of 
propositions defined on the states of the MDP. We synthesize a 
control policy such that the MDP satisfies the given specification 
almost surely, if such a policy exists. In addition, vre designate 
an "optimizing proposition" to be repeatedly satisfied, and we 
formulate a novel optimization criterion in terms of minimizing 
the expected cost in between satisfactions of this proposition. 
We propose a sufficient condition for a policy to be optimal, and 
develop a dynamic programming algorithm that synthesizes a 
policy that is optimal under some conditions, and sub-optimal 
otherwise. This problem is motivated by robotic applications 
requiring persistent tasks, such as environmental monitoring 
or data gathering, to be performed. 

I. Introduction 

In this paper, we consider the problem of controlling a (fi- 
nite state) Markov Decision Process (MDP). Such models are 
widely used in various areas including engineering, biology, 
and economics. In particular, in recent results, they have been 
successfully used to model and control autonomous robots 
subject to uncertainty in their sensing and actuation, such 
as for ground robots [1], unmanned aircraft [2] and surgical 
steering needles [3]. 

Several authors [4]-[7] have proposed using temporal log- 
ics, such as Linear Temporal Logic (LTL) and Computation 
Tree Logic (CTL) [8], as specification languages for control 
systems. Such logics are appealing because they have well 
defined syntax and semantics, which can be easily used to 
specify complex behavior. In particular, in LTL, it is possible 
to specify persistent tasks, e.g., "Visit regions A, then B, 
and then C, infinitely often. Never enter B unless coming 
directly from £>." In addition, off-the-shelf model checking 
algorithms [8] and temporal logic game strategies [9] can be 
used to verify the correctness of system trajectories and to 
synthesize provably correct control strategies. 

The existing works focusing on LTL assume that a finite 
model of the system is available and the current state can be 
precisely determined. If the control model is deterministic 
(i.e., at each state, an available control enables exactly one 
transition), control strategies from specifications given as 
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LTL formulas can be found through simple adaptations of 
off-the-shelf model checking algorithms [10]. If the control 
is non-deterministic (an available control at a state enables 
one of several transitions, and their probabilities are not 
known), the control problem from an LTL specification can 
be mapped to the solution of a Buchi or GR(1) game if the 
specification is restricted to a fragment of LTL [4], [11]. 
If the probabilities of the enabled transitions at each state 
are known, the control problem reduces to finding a control 
policy for an MDP such that a probabilistic temporal logic 
formula is satisfied [12]. 

By adapting methods from probabilistic model-checking 
[12]-[14], we have recently developed frameworks for de- 
riving MDP control policies from LTL formulas [15], which 
is related to a number of other approaches [16], [17] that 
address the problem of synthesizing control policies for 
MDPs subject to LTL satisfaction constraints. In all of the 
above approaches, a control policy is designed to maximize 
the probability of satisfying a given LTL formula. However, 
no attempt has been made so far to optimize the long-term 
behavior of the system, while enforcing LTL satisfaction 
guarantees. Such an objective is often critical in many 
applications, such as surveillance, persistent monitoring, and 
pickup delivery tasks, where a robot must repeatedly visit 
some areas in an environment and the time in between 
revisits should be minimized. 

As far as we know, this paper is the first attempt to 
compute an optimal control policy for a dynamical sys- 
tem modeled as an MDP, while satisfying temporal logic 
constraints. This work begins to bridge the gap between 
our prior work on MDP control policies maximizing the 
probability of satisfying an LTL formula [15] and optimal 
path planning under LTL constraints [18]. We consider LTL 
formulas defined over a set of propositions assigned to the 
states of the MDP. We synthesize a control policy such that 
the MDP satisfies the specification almost surely, if such a 
policy exists. In addition, we minimize the expected cost 
between satisfying instances of an "optimizing proposition." 

The main contribution of this paper is two-fold. First, we 
formulate the above MDP optimization problem in terms 
of minimizing the average cost per cycle, where a cycle 
is defined between successive satisfaction of the optimizing 
proposition. We present a novel connection between this 
problem and the well-known average cost per stage problem. 
Second, we incorporate the LTL constraints and obtain a 
sufficient condition for a policy to be optimal. We present 
a dynamic programming algorithm that under some (heavy) 
restrictions synthesizes an optimal control policy, and a sub- 
optimal policy otherwise. 



The organization of this paper is as follows. In Sec. |ll] 
we provide some preliminaries. We formulate the problem 
in Sec. and we formalize the connection between the 
average cost per cycle and the average cost per stage problem 
in Sec. |IV] In Sec. |V] we provide a method for incorporating 
LTL constraints. We present a case study illustrating our 
framework in Sec. |VT] and we conclude in Sec. IVIII 

II. Preliminaries 

A. Linear Temporal Logic 

We employ Linear Temporal Logic (LTL) to describe 
MDP control specifications. A detailed description of the 
syntax and semantics of LTL is beyond the scope of this 
paper and can be found in [8], [13]. Roughly, an LTL 
formula is built up from a set of atomic propositions 11, 
standard Boolean operators ^ (negation), V (disjunction), A 
(conjunction), and temporal operators Q (next), U (until), 

(eventually), □ (always). The semantics of LTL formulas 
are given over infinite words in 2^^. A word satisfies an 
LTL formula if is true at the first position of the 
word; means that is true at all positions of the word; 

means that (f) eventually becomes true in the word; 
(fii U(t>2 means that has to hold at least until 02 is true. 
More expressivity can be achieved by combining the above 
temporal and Boolean operators (more examples are given 
later). An LTL formula can be represented by a deterministic 
Rabin automaton, which is defined as follows. 
Definition II.l (Deterministic Rabin Automaton). A de- 
terministic Rabin automaton (DRA) is a tuple TZ =^ 
{Q,T,,d,qQ, F), where (i) Q is a finite set of states; (ii) E 
is a set of inputs (alphabet); (Hi) S : Q x ^ Q is the 
transition function; (iv) qq £ Q is the initial state; and (v) 
F = {(L(l), K{1)), . . . , {L{M),K{M))} is a set of pairs of 
sets of states such that L{i), K{i) C Q for all i ~ 1, . . . , M. 

A run of a Rabin automaton TZ, denoted hy r-ji = q^qi . . ., 
is an infinite sequence of states in TZ such that for each i > 0, 
Qi+i (z 6(qi,a) for some a G S. A run r-ji is accepting if 
there exists a pair {L,K) e F such that 1) there exists 
n > 0, such that for all m > n, we have g.,„ ^ L, and 
2) there exist infinitely many indices k where qk G K. This 
acceptance conditions means that r-ji is accepting if for a pair 
{L,K) G F, r-ji intersects with L finitely many times and 
K infinitely many times. Note that for a given pair (L, K), 

1 can be an empty set, but K is not empty. 

For any LTL formula (/> over 11, one can construct a DRA 
with input alphabet Tj = 2^ accepting all and only words 
over n that satisfy (f) (see [19]). We refer readers to [20] 
and references therein for algorithms and to freely available 
implementations, such as [21], to translate a LTL formula 
over n to a corresponding DRA. 

B. Markov Decision Process and probability measure 

Definition II.2 (Labeled Markov Decision Process). A la- 
beled Markov decision process (MDP) is a tuple Ai — 
(S, U, P, So, n, C, g), where S — {1, . . . ,n} is a finite set of 
states; U is a finite set of controls (actions) (with slight abuse 
of notations we also define the function U(i), where i G S 



and U{i) QU to represent the available controls at state i); 
P : S X U X S ^ [0, 1] is the transition probability function 
such that for all i d S, '^j^g P{i,u, j) = 1 if u £ U{i), 
and P{i,u,j) = Q if u <^ U{i); sq £ S is the initial state; 
n is a set of atomic propositions; C : S 2^ is a labeling 
function and g : S x U — > is such that g{i,u) is the 
expected (non-negative) cost when control u G U{i) is taken 
at state i. 

We define a control function /.i : S ^ U such that fi{i) G 
U{i) for all i G S*. A infinite sequence of control functions 
M — {fiQ, fii, . . .} is called a policy. One can use a policy to 
resolve all non-deterministic choices in an MDP by applying 
the action /ife(sfe) at state Sk- Given an initial state sq, an 
infinite sequence rj^ = sqSi . . . on A4 generated under a 
policy M is called a path on Ai if P{sk, f^kisk), Sfc+i) > 
for all k. The index k of a path is called stage. If /i^ = /i 
for all k, then we call it a stationary policy and we denote it 
simply as /i. A stationary policy /i induces a Markov chain 
where its set of states is S and the transition probability from 
state i to j is P{i, j). 

We define Paths^ and FPaths^ as the set of all infinite 
and finite paths of Ai under a policy M, respectively. We 
can then define a probability measure over the set Paths;JJ^. 
For a path rj^ — sqSi . . . SmSm+i • . • G Paths^, the prefix 
of length m of rj^ is the finite subsequence sqSi . . . Sm- 
Let Paths;J^(soSi . . . Sm) denote the set of all paths in 
Paths^JJ^ with the prefix sgSi . . . s^- (Note that sqSi . . . s™ 
is a finite path in FPaths^.) Then, the probability measure 
Pi% on the smallest a-algebra over Paths;J)|^ containing 
Paths;JJ^(soSi . . . s,„) for all sosi . . . s™ G FPaths;J^ is the 
unique measure satisfying: 

Pr;5;^{Paths;Jl^(soSi . . . s„i)} = Yl P{sk, fJ'kisk), Sk+i)- 

0<fe<n 

Finally, we can define the probability that an MDP Ai 
under a policy M satisfies an LTL formula (p. A path = 
SqSi . . . deterministically generates a word o = oqOi . . ., 
where Oi — C{si) for all i. With a slight abuse of notation, 
we denote £(rjj^) as the word generated by rj^. Given an 
LTL formula 0, one can show that the set {rj^ G Paths^J^ : 
£{r^) N </)} is measurable. We define: 

Pr^((/)) Pr^{rj^ G Paths^ : /:(r^) N cf>} (1) 

as the probability of satisfying </> for Ai under M. For more 
details about probability measures on MDPs under a policy 
and measurability of LTL formulas, we refer readers to a text 
in probabilistic model checking, such as [13]. 

III. Problem Formulation 

Consider a weighted MDP Ai = {S,U, P, so,Ii, C, g) and 
an LTL formula (p over 11. As proposed in [18], we assume 
that formula cj) is of the form: 

(/) = DOtt a -0, (2) 

where the atomic proposition tt G 11 is called the optimizing 
proposition and ip is an arbitrary LTL formula. In other 
words, (j) requires that be satisfied and n be satisfied 



infinitely often. We assume that there exists at least one 
policy M of Ai such that A4 under M satisfies (j) almost 
surely, i.e., Prj^((/)) = 1 (in this case we simply say M 
satisfies (j) almost surely). 

We let M be the set of all policies and be the set of all 
policies satisfying (j) almost surely. Note that if there exists a 
control policy satisfying almost surely, then there typically 
exist many (possibly infinite number of) such policies. 

We would like to obtain the optimal policy such that 
is almost surely satisfied, and the expected cost in between 
visiting a state satisfying tt is minimized. To formalize this, 
we first denote Stt ^ {i E S,tt E C{i)} (i.e., the states where 
atomic proposition tt is true). We say that each visit to set 
Stt completes a cycle. Thus, starting at the initial state, the 
finite path reaching S*^ for the first time is the first cycle; 
the finite path that starts after the completion of the first 
cycle and ends with revisiting S^r for the second time is the 
second cycle, and so on. Given a path rj^ — sqSi . . ., we use 
C (rj^ , N) to denote the cycle index up to stage N, which is 
defined as the total number of cycles completed in N stages 
plus 1 [i.e., the cycle index starts with 1 at the initial state). 

The main problem that we consider in this paper is to find 
a policy that minimizes the average cost per cycle (ACPC) 
starting from the initial state sq. Formally, we have: 
Problem III.l. Find a policy M = {po, pi,. . .}, Me 
that minimizes 



'Ek=o9{sk,lik{sk)) 
C{r%,N) 



J{sq) — lim sup ii' ■ 



(3) 



where E{-} denotes the expected value. 

Prob. |III.T] is related to the standard average cost per stage 
(ACPS) problem, which consist of minimizing 



Ef=o5(sfe,Aife(sfe)) 



J^{so) ~ lim sup ii' ■ 

N^oo 



N 



(4) 



over M, with the noted difference that the right-hand-side 
(RHS) of Q is divided by the index of stages instead of 
cycles. The ACPS problem has been widely studied in the 
dynamic programming community, without the constraint of 
satisfying temporal logic formulas. 

The ACPC cost function we consider in this paper is 
relevant for probabilistic abstractions and practical appli- 
cations, where the cost of controls can represent the time, 
energy, or fuel required to apply controls at each state. In 
particular, it is a suitable performance measure for persistent 
tasks, which can be specified by LTL formulas. For example, 
in a data gathering mission [18], an agent is required to 
repeatedly gather and upload data. We can assign vr to the 
data upload locations and a solution to Prob. III. 1 minimizes 



the expected cost in between data upload. In such cases, 
the ACPS cost function does not translate to a meaningful 
performance criterion. In fact, a policy minimizing Q may 
even produce an infinite cost in ([3|. Nevertheless, we will 
make the connection between the ACPS and the ACPC 



Remark III.2 (Optimization Criterion). The optimization 
criterion in Prob. \UL1\ is only meaningful for specifications 
where tt is satisfied infinitely often. Otherwise, the limit from 
Eq. (PI is infinite (since g is a positive-valued function) and 
ProkUirj\has no solution. This is the reason for choosing (j) 
in the form HOt: and for only searching among policies 
that almost surely satisfy (j). 

IV. Solving the average cost per cycle problem 

A. Optimality conditions for ACPS problems 

In this section, we recall some known results on the ACPS 
problem, namely finding a policy over M that minimizes J'^ 
in Q. The reader interested in more details is referred to 
[22], [23] and references therein. 

Definition IV.l (Weak Accessibility Condition). An MDP 

M. is said to satisfy the Weak Accessibility (WA ) condition 
if there exist Sr Q S, such that (i) there exists a stationary 
policy where j is reachable from i for any Sr, and (ii) 

states in S \ Sr are transient under all stationary policies. 

MDP M is called single-chain (or weakly-communicating) 
if it satisfies the WA condition. If M satisfies the WA 
condition with 5*^ = S, then Ai is called communicating. 
For a stationary policy, it induces a Markov chain with a 
set of recurrent classes. A state that does not belong to any 
recurrent class is called transient. A stationary policy p, is 
called unichain if the Markov chain induced by fj, contains 
one recurrent class (and a possible set of transient states). If 
every stationary policy is unichain, Ai is called unichain. 

Recall that the set of states of M is denoted by {1, ri}. 
For each stationary policy /i, we use to denote the 
transition probability matrix: P^(i,j) — P{i, p(i), j). Define 
vector g^ where .gp(i) = g(i,iJ,(i)). For each stationary 
policy /i, we can obtain a so-called gain-bias pair (J^, /i* ), 
where 



Kg, 



(5) 



with 



P,t = lim — V P^ Ha^il - Pa + P*y^ - P* 



k=0 



(6) 

The vector J'^ ^ [J"^{1), • • • , J^HY is such that J^^{i) is 
the ACPS starting at initial state i under policy /i. Note that 
the limit in (|6]l exists for any stochastic matrix P^, and P* is 
stochastic. Therefore, the lim sup in Q can be replaced by 
the limit for a stationary policy. Moreover, ( J^, h"^) satisfies 



Jf. 



By noting that 



P 7'' 



5m 



P h'' 



(7) 



(8) 



problems in Sec. IV 



for some vector w^, we see that is the solution 

of 3n linear equations with 3n unknowns. 

It has been shown that there exists a stationary optimal 
policy p* minimizing (HI) over all policies, where its gain-bias 



pair {J^,h^) satisfies the Bellman's equations for average 
cost per stage problems: 



J'^(i) — min > 



and 



min 

uSC/(i) 



5(^,^i)+^P(^,u,i)/^^(J) 



(9) 



(10) 



for alH = 1, . . . , n, where \Ji is the set of controls attaining 
the minimum in (|9]|. Furthermore, if Jv[ is single-chain, the 
optimal average cost does not depend on the initial state, 
i.e., J^* (i) = A for all i € S*. In this case, (|9]l is trivially 
satisfied and Ui in ( [TO] i can be replaced by C/(i). Hence, /i* 
with gain-bias pair (Al, K) is optimal over all polices if for 
all stationary policies /i we have: 

Al + /i < + P,,/i, (11) 
where 1 e M" is a vector of all Is and < is component-wise. 

B. Optimality conditions for ACPC problems 

Now we derive equations similar to (|9]l and ( fTO] ) for ACPC 
problems, without considering the satisfaction constraint, i.e., 
we do not limit the set of polices to at the moment. We 
consider the following problem: 

Problem IV.2. Given a communicating MDP A4 and a set 
Stt, find a policy /i G M that minimizes (|3]l. 

Note that, for reasons that will become clear in Sec. |V] 
we assume in Prob. 1V.2 that the MDP is communicating. 
However, it is possible to generalize the results in this section 
to an MDP that is not communicating. 

We limit our attention to stationary policies. We will 
show that, similar to the majority of problems in dynamic 
programming, there exist optimal stationary policies, thus it 
is sufficient to consider only stationary policies. For such 
policies, we use the following notion of proper policies, 
which is the same as the one used in stochastic shortest path 
problems (see [22]). 

Definition IV.3 (Proper Polices). We say a stationary policy 
fj, is proper if, under fi, all initial states have positive 
probability to reach the set 5*^ in a finite number of stages. 

We denote = [J ^{1) , . . . ^ J ^{n)Y where J^{i) is the 
ACPC in ([3]) starting from state i under policy ^. If policy 
/i is improper, then there exist some states i E S that can 
never reach 3,^- In this case, since g{i,u) is positive for all 
i,u, we can immediately see that J^(i) = oo. We will first 
consider only proper policies. 

Without loss of generality, we assume that 5^ = 
{!,..., m} (i.e., states m + I, . . . ,n are not in 5'^). Given 
a proper policy fi, we obtain its transition matrix as 



described in Sec. |IV-A Our goal is to express in terms of 
Pp, similar to (|5]) in the ACPS case. To achieve this, we first 
compute the probability that j e 5*7^ is the first state visited 
in S'tt after leaving from a state i E S by applying policy 
/i. We denote this probability by P{i,ii,j). We can obtain 



this probability for all i G 5 and j G 5*^ by the following 
proposition: 

Proposition IV.4. P{i,^,j) satisfies 



P(z,M,j)= J2 PihK^),k)P{k,ti{k),j)+Pii,fi{i)J). 

(12) 



fc— m+l 



Proof. From i, the next state can either be in S"^ or not. The 
first term in the RHS of ( fT2| ) is the probability of reaching 
and the first state is j, given that the next state is not in 
Stt- Adding it with the probability of next step is in 5^ and 
the state is j gives the desired result. ■ 



We now define a n x n matrix P^ such that 







otherwise 



(13) 



We can immediately see that P^ is a stochastic matrix, i.e., all 
its rows sum up to 1 or ^(*' Mi j) = 1- More precisely, 

Pihl^J) = 1 since P{i,fi,j) = for all j = m + 
l,...,n. 

Using ( [T2] l, we can express P^ in a matrix equation in 
terms of P^. To do this, we need to first define two n x n 
matrices from P^ as follows: 

P,i^,j)-l ^/''^^ 'V^^'^ (14) 
1 otherwise ^ ' 



'^^ '-^^ 1 othei-wise 



From Fig. [l] we can see that matrix P^ is "split" into P^ 
and i.e., P^ = "p + 




Fig. 1. The constructions of and from Pp. 

Proposition IV.5. If a policy is proper, then matrix 
is non-singular 

Proof. Since /i is proper, for every initial state i E S, the 
set Stt is eventually reached. Because of this, and since 
= if j G Stt, matrix i^^ is transient, i.e.. 



linife^oo '^fi = 0. From linear algebra (see, e.g., Ch. 9.4 
of [24]), since is transient and sub-stochastic, / — 
is non-singular ■ 



We can then write ( [T2] i as the following matrix equation: 

(16) 



P.. 



Since / — i^^ is invertible, we have 



P, = {I-1^,)-''P 



(17) 



Note that ( [T6] l and ( [T7| ) do not depend on the ordering of the 
states of M, i.e., S.^ does not need to be equal to {1, ... , m}. 

Next, we give an expression for the expected cost of 
reaching S'^r from i E S under /i (if i G 5^, this is the 
expected cost of reaching S^r again), and denote it as g{i, fi). 
Proposition IV.6. g{i, fi) satisfies 

n 

k—m-\-l 

Proof. The first term of RHS of ( fTS] ) is the expected cost 
from the next state if the next state is not in S-^ (if the next 
state is in S-^ then no extra cost is incurred), and the second 
term is the one-step cost, which is incurred regardless of the 
next state. ■ 



We define 5^ such that 5^(1) 
can be written as: 



g{i,fi), and note that ([T8]l 



5m ' 



(19) 



IV-A 



5m 
5m 

where 5^ is defined in Sec. 

We can now express the ACPC in terms of and 
gfj^. Observe that, starting from i, the expected cost of the 
first cycle is g^{i); the expected cost of the second cycle 
is X]j=i ^m(*' J)5m0)' '■^^ expected cost of the third 
cycle is X^JLi ELi -Pm(*' A*. j)^m(J. k)gf,{k); and so on. 
Therefore: 



c-i 



Jf, = limsup ^ P^gf,, 



(20) 



k=0 



where C represents the cycle count. Since is a stochastic 
matrix, the lim sup in ( |20| i can be replaced by the limit, and 
we have 



5m ' 



(21) 



k=0 



where P* for a stochastic matrix P is defined in (|6]l. 

We can now make a connection between Prob. lIV.2l and the 
ACPS problem. Each proper policy /i of can be mapped 
to a policy fl with transition matrix ■= Pfi ™d vector of 
costs gji := g^, and we have 



(22) 



Moreover, we define /i^ :— /i?. Together with J^, pair 
{J^.hfj) can be seen as the gain-bias pair for the ACPC 



problem. We denote the set of all polices that can be mapped 
to a proper policy as M^. We see that a proper policy 
minimizing the ACPC maps to a policy over minimizing 
the ACPS. 

The by-product of the above analysis is that, if /i is proper, 
then Jf_i{i) is finite for all i, since P* is a stochastic matrix 
and is finite. We now show that, among stationary 

policies, it is sufficient to consider only proper policies. 
Proposition IV.7. Assume /i to be an improper policy. If Ai 
is communicating, then there exists a proper policy /i' such 
that Jfii{i) < J^{i) for all i = 1, . . . ,n, with strict inequality 
for at least one i. 

Proof We partition S into two sets of states: S^t^ is the set 
of states in 5 that cannot reach S*^ and S^^^ as the set of 
states that can reach Sj^ with positive probability. Since fj, 
is improper and g{i,u) is postive -valued, 5^7r is not empty 
and Jfj,{i) = 00 for all i £ S^t,. Moreover, states in 
cannot visit S'_i.7r by definition. Since A4 is communicating, 
there exists some actions at some states in S^t, such that, 
if applied, all states in S^-^ can now visit 5*^ with positive 
probability and this policy is now proper (all states can now 
reach 3^^)- We denote this new policy as /i'. Note that this 
does not increase Jp(i) if i G S^^r since controls at these 
states are not changed. Moreover, since /i' is proper, J^' (z) < 
00 for all i G S^^^. Therefore Jf_i'{i) < Jf_i{i) for all i G 



Using the connection to the ACPS problem, we have: 
Proposition IV.8. The optimal ACPC policy over stationary 
policies is independent of the initial state. 

Proof We first consider the optimal ACPC over proper 
policies. As mentioned before, if all stationary policies of 
an MDP satisfies the WA condition (see Def. |IV.l| l, then 
the ACPS is equal for all initial states. Thus, we need to 
show that the WA condition is satisfied for all //. We will 
use as set Sr- Since A4 is communicating, then for each 
pair i, j G Sj^, P{i,fJ,,j) is positive for some /i, therefore 
from ( [T2j i, Pfi{i,i) is positive for some /i {i.e., Pii{i,j) is 
positive for some fl), and the first condition of Def. |I V. 1 1 is 
satisfied. Since /i is proper, the set S^r can be reached from 
all i e S.ln addition, Pjx{i,j) =0 for all j <^ S^,. Thus, all 
states i ^ S'tt are transient under all policies /2 G M^, and 
the second condition is satisfied. Therefore WA condition 
is satisfied and the optimal ACPS over is equal for all 
initial state. Hence, the optimal ACPC is the same for all 

we can 



initial states over proper policies. Using Prop. IV.7 



conclude the same statement over stationary policies. ■ 

The above result is not surprising, as it mirrors the result 
for a single-chain MDP in the ACPS problem. Essentially, 
transient behavior does not matter in the long run so the 
optimal cost is the same for any initial state. 

Using Bellman's equation (|9| and ( fTO] ), and in particular 
the case when the optimal cost is the same for all initial 



states ( fTT) , policy //* with the ACPS gain-bias pair {XI, h) 
satisfying for all p, £ Mf^: 



For the first equation in ( p6] l, using ( [TT] ), we have 



Xl + h<g^ + Pf,h 



(23) 



is optimal. Equivalently, /i* that maps to fi* is optimal over 
all proper policies. The following proposition shows that this 
policy is optimal over all policies in M, stationary or not. 
Proposition IV.9. The proper policy p,* that maps to p* 
satisfying (|23[) is optimal over M. 



Proof. Consider a M — {fJ.i,P2, ■ ■ ■} ™d assume it to be 
optimal. We first consider that M is stationary for each cycle, 
and the policy is fik for the fc-th cycle. Among this type of 
polices, from Prop. |IV.7[ we see that if M is optimal, then 
Pk is proper for all k. In addition, the ACPC of policy M is 
the ACPS with policy {pi,P2, ■ ■ ■}■ Since the optimal policy 
of the ACPS is /i* (stationary). Then we can conclude that 
if M is stationary in between cycles, then optimal policy for 
each cycle is fi* and thus AI = /i*. 

Now we assume that M is not stationary for each cycle. 
Since g{i,u) > for all i,u, and there exists at least one 
proper policy, the stochastic shortest path problem for Stt 
admits an optimal stationary policy as a solution [22]. Hence, 
for each cycle fc, the cycle cost can be minimized if a 
stationary policy is used for the cycle. Therefore, a policy 
which is stationary in between cycles is optimal over M, 
which is in turn, optimal if M = p* . The proof is complete. 



Unfortunately, it is not clear how to find the optimal policy 
from (|23j except by searching through all policies in M^. 
This exhaustive search is not feasible for reasonably large 
problems. Instead, we would like equations in the form of 
(|9]l and ( [TO| i, so that the optimizations can be carried out 
independently at each state. 

To circumvent this difficulty, we need to express the gain- 
bias pair ( J^, hfj), which is equal to (J?, /i?), in terms of /x. 
From (|7]), we have 



Ju. 



T 



.9A 



By manipulating the above equations, we can now show that 
and /ip can be expressed in terms of /i (analogous to (|7]l) 
instead of p via the following proposition: 
Proposition IV.IO. We have 

J^ = Pk.Jtj.. Ji^ + hf,^gf, + Pf,h^ + l^f,Jf,. (24) 

Moreover, we have 

{I-^^)K + v^^P^v^, (25) 

for some vector v^. 

Proof. We start from (|7]i: 

Jfi^Pi^Jfj., J)^ + h^^ gf, + Pj:,hfj,. (26) 
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For the second equation in (|26]l, using ( fTTj l and ([T9|, we have 

{I-ff,)iJf, + h^) = g^ + 1^f,h^ 



Thus, ( [26| ) can be expressed in terms of p as: 

To compute and h^, we need an extra equation similar 
to (|8]l. Using (|8]l, we have 

hfi + Vf^ = PjiVf_, 
hfi + Vf, = {I -l^^,)-^P^Vf, 
{I -^fi)hf, + Vf, = P^Vf,, 

which completes the proof. ■ 



From Prop. |IV.10| similar to the ACPS problem, 
(J^,/ip,u^) can be solved together by a linear system of 
3n equations and 3n unknowns. The insight gained when 
simplifying and in terms of p motivate us to propose 
the following optimality condition for an optimal policy. 
Proposition IV.ll. The policy ^* with gain-bias pair (Al, h) 
satisfying 



A + hii) = min 

ueu(i) 

n n 

g{i,u)+J2Pii:U,j)h{j) + X P{i,u,j) 

j=l j=m+l 



ill) 



for all i — 1 , . . . , n, is the optimal policy minimizing ^ 
over all policies in M. 



Proof. The optimality condition p7) l can be written as: 

Xl + h<g^ + P^h + ~^^Xl, (28) 

for all stationary policies p. 

Note that, given a.b e M" and a < &, if A is an 
n X n matrix with all non-negative entries, then Aa < Ah. 
Moreover, given c € M", we have a + c < 6 + c. 



(a) (b) (c) 

Fig. 2. The construction of tlie product MDP between a labeled MDP and a DRA. In this example, the set of atomic proposition is {a, tt}. (a): A 
labeled MDP where the label on top of a state denotes the atomic propositions assigned to the state. The number on top of an arrow pointing from a state 
s to s' is the probability P(s, u, s') associated with a control u £ U{s). The set of states marked by ovals is Sjr- (b): The DRA Ti^ corresponding to 
the LTL formula = OOtt A DOa. In this example, there is one set of accepting states F = {{L, K)} where L = and K = {52, 93} (marked by 
double-strokes). Thus, accepting runs of this DRA must visit 52 or 93 (or both) infinitely often, (c): The product MDP V = M X TZ^ where states of K-p 
are marked by double-strokes and states of S-p-^ are marked by ovals. The states with dashed borders are unreachable, and they are removed from Sp. 



From (ESll we have 
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(29) 



If /i is proper, then i^^ is a transient matrix (see proof of 



Prop. IV.5 I, and all of its eigenvalues are strictly inside the 
unit circle. Therefore, we have 

Therefore, since all entries of are non-negative, all 
entries of (/ — i^^^)^^ are non-negative. Thus, continuing 
from (|29ll, we have 
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for all proper policies /i and all jl e M^. Hence, /i* satisfies 
( |23| l and /i* is optimal over all proper policies. Using Prop. 
IV.9 the proof is complete. ■ 



We will present an algorithm that uses Prop. IV. 1 1 to find 
the optimal policy in the next section. Note that, unlike (|23|, 
( p7j i can be checked for any policy by finding the minimum 
for all states i — 1, . . . ,n, which is significantly easier than 
searching over all proper policies. 

V. Synthesizing the optimal policy under LTL 

CONSTRAINTS 

In this section we outline an approach for Prob. |III. 1 1 We 
aim for a computational framework, which in combination 
with results of [15] produces a policy that both maximizes 
the satisfaction probability and optimizes the long-term per- 



A. Automata-theoretical approach to LTL control synthesis 

Our approach proceeds by converting the LTL formula 
to a DRA as defined in Def |II.1| We denote the 
resulting DRA as 7^0 = {Q,2^,6,qo,F) with F = 
{(L(l), (1)), . . . , iL{M),K{M))} where L{i),K{i) C Q 
for alH = 1, . . . , M. We now obtain an MDP as the product 
of a labeled MDP M and a DRA 7^^, which captures all 
paths of A4 satisfying (p. 

Definition V.l (Product MDP). The product MDP x 7^0 
between a labeled MDP A4 — {S,U, P, Sq,^, C, g) and a 
DRA TZ^ = {Q,2^,6,qo,F) is obtained from a tuple V = 

{S-p,U,P-p,s-po,F-p,S-pT,,g-p), where 

(i) S-p — S X Q is a set of states; 

( ii) U is a set of controls inherited from A4 and we define 
Uv{is,q)) = Uis); 

(Hi) P-p gives the transition probabilities: 



Pviis,q),u,is',q')) = 



P{s,u,s') ifq'^6{q,C{s)) 
otherwise; 



formance of the system, using results from Sec. IV 



(iv) s-pQ = {sotQo) '■5 the initial state; 
(V) Fv - {{Lvil),Kvil)),...ALv{M),KviM))} 
where L-p{i) — S x Lii), K-p{i) = S x K{i), for 
i^l,...,M; 

( vi) S-ptt is the set of states in S-p for which proposition tt 
is satisfied. Thus, S-p^y ~ 3,^ x Q; 

(vii) gv{{s, q),u) = g{s, u) for all (s, q) E Sv; 

Note that some states of S-p may be unreachable and 
therefore have no control available. After removing those 
states (via a simple graph search), P is a valid MDP and is 
the desired product MDP. With a slight abuse of notation we 
still denote the product MDP as V and always assume that 
unreachable states are removed. An example of a product 
MDP between a labeled MDP and a DRA corresponding to 
the LTL formula (f) ~ DOtt A DOa is shown in Fig. |2] 

There is an one-to-one correspondence between a path 
sqSi, ... on and a path (sq, 9o)(si, qi) ... on P. More- 
over, from the definition of g-p, the costs along these two 



paths are the same. The product MDP is constructed so 
that, given a path (sq, qo){si, qi) . . ., the corresponding path 
sqSi ... on generates a word satisfying cj) if and only if, 
there exists {L-p,K-p) e F-p such that the set K-p is visited 
infinitely often and L-p finitely often. 

A similar one-to-one correspondence exists for policies: 
Definition V.2 (Inducing a policy from V). Given policy 
M-p = {^1^ , , ■ ■ ■} on V, where fil{{s,q)) € Uv{{s,q)), 
it induces policy M — {/iQj A^ii • • ■} on Ai by setting 
l^ki^k) = l^'k ii^kjQk)) fof all k. We denote AI-p\j^ as the 
policy induced by M-p, and we use the same notation for a 
set of policies. 

An induced policy can be implemented on by simply 
keeping track of its current state on V. Note that a stationary 
policy on V induces a non-stationary policy on M.. From 
the one-to-one correspondence between paths and the equiv- 
alence of their costs, the expected cost in Q from initial 
state So under M-p\m is equal to the expected cost from 
initial state (sqi^o) under M-p. 

For each pair of states {L-p, K-p) € F-p, we can obtain a 
set of accepting maximal end components (AMEC): 
Definition V.3 (Accepting Maximal End Components). 
Given {L-p, K-p) G F-p, an end component C is a communi- 
cating MDP {Sc,Uc,Pc,Kc,Sc-K,gc) such that Sq Q S-p, 
Uc C Uv, Uc{i) C U{i) for all i e Sc, Ke = Sc n Kv, 
Sctt = Sc Ci S-p^, and gc{i,u) = .g-p(j,w) if i e Sc, 
u € Uc{i). If P{i,u, j) > for any i ^ Sc and u e Uc{i), 
then j e Sc, in which case Pc{i,u,j) — P{i,u,j). An 
accepting maximal end components (AMEC) is the largest 
such end component such that Kc 7^ and Sc H L-p — 0. 

Note that, an AMEC always contains at least one state in 
K-p and no state in L-p. Moreover, it is "absorbing" in the 
sense that the state does not leave an AMEC once entered. 
In the example shown in Fig.|2j there exists only one AMEC 
corresponding to {Lp, Kp), which is the only pair of states 
in F-p, and the states of this AMEC are shown in Fig. [3] 



So, 50 : So, 91 



Sl,?! ! si,9i 



Si, 92 Si, 93 Si, 94 



(J2,9l) (S2,92) 



(S3-91) ^3,92) 



2,93 1 ' (S2.94) 



S3, 93 1 I S3, 94 



Fig. 3. The states of the only AMEC coiTesponding to the product MDP 
in Fig.|2] 

A procedure to obtain all AMECs of an MDP was pro- 
vided in [13]. From probabilistic model checking, a policy 
M = M-p\m almost surely satisfies (j) {i.e., M E M^) if and 
only if, under policy AI-p, there exists AMEC C such that 
the probability of reaching Sc from initial state {sq, qq) is 1 
(in this case, we call C a reachable AMEC). In [15], such 
an optimal policy was found by dynamic programming or 



solving a linear program. For states inside C, since C itself 
is a communicating MDP, a policy (not unique) can be easily 
constructed such that a state in Kc is infinitely often visited, 
satisfying the LTL specification. 

B. Optimizing the long-term performance of the MDP 

For a control policy designed to satisfy an LTL formula, 
the system behavior outside an AMEC is transient, while the 
behavior inside an AMEC is long-term. The policies obtained 
in [15]-[17] essentially disregard the behavior inside an 
AMEC, because, from the verification point of view, the 
behavior inside an AMEC is for the most part irrelevant, as 
long as a state in K-p is visited infinitely often. We now aim 
to optimize the long-term behavior of the MDP with respect 
to the ACPC cost function, while enforcing the satisfaction 
constraint. Since each AMEC is a communicating MDP, we 



can use results in Sec. IV-B to help obtaining a solution. Our 
approach consists of the following steps: 

(i) Convert formula to a DRA TZ^ and obtain the product 
MDP V between M and 7^0; 

(ii) Obtain the set of reachable AMECs, denoted as A; 

(iii) For each C E A: Find a stationary policy 
defined for i E S\Sc, that reaches Sc with probability 
1 {^*_^c guaranteed to exist and obtained as in [15]); 
Find a stationary policy ijL*j^{i), defined for i E Sc 
minimizing ^ for MDP C and set Scn while satisfying 
the LTL constraint; Define /ij to be: 

fj.l^ci'i-) ^fi'^Sc 
and denote the ACPC of as Ac; 



(30) 



(iv) We find the solution to Prob. III.l by: 



r{so) 



min Xc , 

ceA ' 



(31) 



and the optimal policy is /i^* Ix. where C* is the AMEC 
attaining the minimum in pT] ). 
We now provide the sufficient conditions for a policy 
to be optimal. Moreover, if an optimal policy /i^^ can 
be obtained for each C, we show that the above procedure 



indeed gives the optimal solution to Prob. III.l 



Proposition V.4. For each C E A, let /i^ to be constructed 
as in ( |30| l, where fi*j^ is a stationary policy satisfying 
two optimality conditions: (i) its ACPC gain-bias pair is 
(Acl, h), where 



Ac + h{i) = min 

u£Uc{i} 



9c{hu)+ J2 Pi 
jesc 



c E 



i^u,j)h{j) 
P{hu,j) , 



(32) 



for all i E Sc, and (ii) there exists a state of Kc in each 
recurrent class of fJ-'^c- Then the optimal cost for Prob. III.l 
is J*{so) = mincgyi Ac, and the optimal policy is fi^t 
where C* is the AMEC attaining this minimum. 

Proof. Given C E A, define a set of policies Mc, such that 
for each pohcy in Mc: from initial state {so,qo), (i) Sc is 



reached with probability 1, (ii) S\Sc is not visited thereafter, 
and (iii) Kq is visited infinitely often. We see that, by the 
definition of AMECs, a policy almost surely satisfying 
belongs to Mc\m for a C G A. Thus, = UceA^clM 

Since = fJ-%ci''') i ^ Sc, the state reaches Sc with 

probability 1 and in a finite number of stages. We denote the 
probability that j G Sc is the first state visited in Sc when C 
is reached from initial state spg as Pc{j, l^%c^ ^'Po)- Since 
the ACPC for the finite path from the initial state to a state 
j E Sc is as the cycle index goes to oo, the ACPC from 
initial state s-po under policy /i^ is 



jeSc 



(33) 



Since C is communicating, the optimal cost is the same for 
all states of Sc (and thus it does not matter which state in 
Sc is first visited when Sc is reached). We have 



jeSc 
= Ac. 



(34) 



Applying Prop. IV. 11 we see that ii^^ satisfies the opti- 
mality condition for MDP C with respect to set Sctt- Since 
there exists a state of Kc is in each recurrent class of /i^^' ^ 
state in Kc is visited infinitely often and it satisfies the LTL 
constraint. Therefore, /i^ as constructed in ([30]l is optimal 
over Mc and /i^lx is optimal over Mc|ai (due to equiv- 
alence of expected costs between M-p and M-p\j^). Since 
— Ucs^McIm, we have that J*(so) — mincg^ Ac and 
the policy corresponding to C* attaining this minimum is the 
optimal policy. ■ 

We can relax the optimality conditions for in Prop. 
V.4 and require that there exist a state i S Kc in one recurrent 



class of /i^c- P^'" svich a policy, we can construct a policy 
such that it has one recurrent class containing state i, with the 
same ACPC cost at each state. This construction is identical 
to a similar procedure for ACPS problems when the MDP 
is communicating (see [22, p. 203]). We can then use ( |30l ) 
to obtain the optimal policy /i^ for C. 

We now present an algorithm (see Alg. [T]i that iteratively 
updates the policy in an attempt to find one that satisfies the 
optimality conditions given in Prop. V.4 for a given C G 



A. Note that Alg. [T] is similar in nature to policy iteration 
algorithms for ACPS problems. 

Proposition V.5. Given C, Alg. [7] terminates in a finite 
number of iterations. If it returns policy fj,(jc with "optimal", 
then /Xf)C satisfies the optimality conditions in Prop. V.4 
If C is unichain (i.e., each stationary policy of C contains 
one recurrent class), then Alg. |7]/i guaranteed to return the 
optimal policy fJ-*jc- 

Proof. If C is unichain, then since it is also communicating, 
fi*jQ contains a single recurrent class (and no transient state). 
In this case, since Kc is not empty, states in Kc are recurrent 
and the LTL constraint is always satisfied at step 7 and 9 
of Alg. [T] The rest of the proof (for the general case and 



Algorithm 1 : Policy iteration algorithm for ACPC 

Input: C = iSc,Uc,Pc,Kc, Sc., gc) 
Output: Policy fifjc 
1: Initialize /i" to a proper policy containing Kc in its re- 
current classes (such a policy can always be constructed 
since C is communicating) 
2: repeat 

3: Given fi'', compute J^k and h^k with (|24]) and 
4: Compute for all i E Sc- 



Oil) 



argmin 



P{i,u,j)J k{i) 



(35) 



5: 
6: 



if E U{i) for all i E Sc then 
Compute, for all i E Sc'. 



M{i) 



arg mm 



9c{i,u)+Y P{i,u,j)h^k{j) 

j&Sc 

+ P{^,u,j)J^k{j) (36) 

Find ^'=+1 such that E M{i) for all i E Sc, 

and contains a state of Kc in its recurrent classes. If 
one does not exist. Return: fi'^ with "not optimal" 
else 

Find /x'^+i such that fi^+^{i) E U{i) for all i E Sc, 
and contains a state of Kc in its recurrent classes. If 
one does not exist. Return: fjL^ with "not optimal" 
end if 

Set fc ^ A: + 1 
until /i*^ with gain-bias pair satisfying ( |32| ) and Return: 
/i*^ with "optimal" 



not assuming C to be unichain) is similar to the proof of 
convergence for the policy iteration algorithm for the ACPS 
problem (see [22, pp. 237-239]). Note that the proof is the 
same except that when the algorithm terminates at step 11 in 
Alg. [T| /i'^' satisfies ( (32] i instead of the optimality conditions 
for the ACPS problem and ■ 

If we obtain the optimal policy for each C E A, then 
we use ( |3T| i to obtain the optimal solution for Prob. III.l 



If for some C, Alg. [T] returns "not optimal", then the policy 
returned by Alg. [T] is only sub-optimal. We can then apply 
this algorithm to each AMEC in A and use ( (3T] i to obtain 
a sub-optimal solution for Prob. |III.1| Note that similar to 
policy iteration algorithms for ACPS problems, either the 
gain or the bias strictly decreases every time when jj. is 
updated, so policy /i is improved in each iteration. In both 
cases, the satisfaction constraint is always enforced. 
Remark V.6 (Complexity). The complexity of our proposed 
algorithm is dictated by the size of the generated MDPs. 
We use I • I fo denote cardinality of a set. The size of the 
DRA (\Q\} is in the worst case, doubly exponential with 
respect to However, empirical studies such as [20] have 
shown that in practice, the sizes of the DRAs for many LTL 



formulas are generally much lower and manageable. The 
size of product MDP V is at most \S\ x \Q\. The complexity 
for the algorithm generating AMECs is at most quadratic in 
the size of V [13]. The complexity of Alg. ^depends on the 
size of C. The policy evaluation (step 3j requires solving a 
system of 3 x \Sc\ linear equation with 3 x \Sc\ unknowns. 
The optimization step (step 4 and 6) each requires at most 
\Uc \ X \Sc \ evaluations. Checking the recurrent classes of ii 
is linear in \Sc\. Therefore, assuming that \Uc \ is dominated 
by IS'cP (which is usually true) and the number of policies 
satisfying \i5\ and for all i is also dominated by \Sc\'^, 
for each iteration, the computational complexity is 0{\Sc\'^). 

VI. Case study 

The algorithmic framework developed in this paper is 
implemented in MATLAB, and here we provide an example 
as a case study. Consider the MDP Ai shown in Fig. |4] 
which can be viewed as the dynamics of a robot navigat- 
ing in an environment with the set of atomic propositions 
{pickup, dropoff }. In practice, this MDP can be obtained 
via an abstraction process (see [1]) from the environment, 
where its probabilities of transitions can be obtained from 
experimental data or accurate simulations. 




Fig. 4. MDP capturing a robot navigating in an environment, {a, fi, 7} is 
the set of controls at states. Tlie cost of applying a, /3, 7 at a state where the 
control is available is 5, 10, 1, respectively, (e.g., g(i, a) = 5 if « £ U(i)) 

The goal of the robot is to continuously perform a pickup- 
delivery task. The robot is required to pick up items at the 
state marked by pickup (see Fig. |4|i, and drop them off at 
the state marked by dropoff. It is then required to go back 
to pickup and this process is repeated. This task can be 
written as the following LTL formula: 

(j) = DOpickup A □(pickup =^ 0(^pickup iYdropof f )). 

The first part of DOpickup, enforces that the robot 
repeatedly pick up items. The remaining part of </> ensures 
that new items cannot be picked up until the current items 
are dropped off. We denote pickup as the optimizing 
proposition, and the goal is to find a policy that satisfies 
(f) with probability 1 and minimizes the expected cost in 
between visiting the pickup state (i.e., we aim to minimize 
the expected cost in between picking up items). 

We generated the DRA TZ^ using the ltl2dstar tool 
[21] with 13 states and 1 pair {L,K) e F. The product 
MDP V after removing unreachable states contains 31 
states (note that V has 130 states without removing 



unreachable states). There is one AMEC C corresponding 
to the only pair in F-p and it contains 20 states. We tested 
Alg. [T] with a number of different initial policies and 
Alg. [^produced the optimal policy within 2 or 3 policy 
updates in each case (note that C is not unichain). For one 
initial policy, the ACPC was initially 330 at each state 
of C, and it was reduced to 62.4 at each state when the 
optimal policy was found. The optimal policy is as follows: 



State 





1 


2 


3 


4 


5 


6 


7 


8 


9 


After pickup 


a 


/3 


a 


a 


a 


7 


7 


a 





a 


After dropoff 


a 


Q 


a 


a 


a 


a 


7 


a 


a 


a 



The first row of the above table shows the policy after 
pick-up but before drop-off and the second row shows the 
policy after drop-off and before another pick-up. 



VII. Conclusions 

We have developed a method to automatically generate a 
control policy for a dynamical system modelled as a Markov 
Decision Process (MDP), in order to satisfy specifications 
given as Linear Temporal Logic formulas. The control policy 
satisfies the given specification almost surely, if such a policy 
exists. In addition, the policy optimizes the average cost 
between satisfying instances of an "optimizing proposition", 
under some conditions. The problem is motivated by robotic 
applications requiring persistent tasks to be performed such 
as environmental monitoring or data gathering. 

We are currently pursuing several future directions. First, 
we aim to solve the problem completely and find an algo- 
rithm that guarantees to always return the optimal policy. 
Second, we are interested to apply the optimization criterion 
of average cost per cycle to more complex models such as 
Partially Observable MDPs (POMDPs) and semi-MDPs. 
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